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Abstract 

In light of the burgeoning interest in network analysis in the new millenium, we bring to the 
attention of contemporary network theorists, a two-stage dou&Ze-standarization and hierarchical 
clustering (single-linkage-like) procedure devised in 1974. In its many applications over the next 
decade-primarily to the migration flows between geographic subdivisions within nations-the pres- 
ence was often revealed of "hubs". These are, typically, "cosmopolitan/non-provincial" areas-such 
as the French capital, Paris-which send and receive people relatively broadly across their respec- 
tive nations. Additionally, this two-stage procedure-which "might very well be the most successful 
application of cluster analysis" (R. C. Dubes)-has detected many (physically or socially) isolated 
groups (regions) of areas, such as those forming the southern islands, Shikoku and Kyushu, of 
Japan, the Italian islands of Sardinia and Sicily, and the New England region of the United States. 
Further, we discuss a (complementary) approach developed in 1976, involving the application of 
the max-flow/min-cut theorem to raw /non- standardized flows. 
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A. L. Barabasi, in his recent popular book, "Linked", asserts that the emergence of hubs 
in networks is a surprising phenomenon that is "forbidden by both the Erdos-Renyi and 
Watts-Strogatz models" [l|, p. 63] [2, Chap. 8]. Here, we indicate an analytical framework 
introduced in 1974 that the distinguished computer scientist R. C. Dubes, in a review of 
[3]], has asserted "might very well be the most successful application of cluster analysis" 4], 
p. 142]. It has proved insightful in revealing-among other interesting relationships-hub-like 
structures in networks of (weighted, directed) internodal flows. In the recent resurgence of 
interest in network analysis, this methodology may have been overlooked, as many of its uses 
had been reported in the 1970's and 198(Ts, in j ournals outside of the strictly mathematical 
and physical literature {5], O, Q, [§], [9], [h], 11, 12, [k], Q (as well as in the research institute 
monographs 0, Q, Q , widely distributed to academic libraries) . 

Though the principal procedure under discussion here is applicable in a wide variety of 
social-science settings {3], y], it has been largely used, in a demographic context, to study 
the internal migration tables published at regular periodic intervals by most of the nations 
of the world. These tables can be thought of as iV x N (square) matrices, the entries (m^) 
of which are the number of people who lived in geographic subdivision i at time t and j at 
time t+1. (Some tables-but not all-have diagonal entries, ma, which may represent the 
number of people who did move within area i, or simply those who lived in i at t and t+1.) 

In the first step of the analytical procedure employed, the rows and columns of the table 
of flows are alternately (biproportionally [l3]) scaled to sum to a fixed number (say 1). 
Under broad conditions-to be discussed below-convergence occurs to a "doubly-stochastic" 



(bistochastic) table, with row and column sums all simultaneously equal to 1 |l8l.ll9u20l.l21|. 
The purpose of the scaling is to remove overall (marginal) effects of size, and focus on relative, 



interaction effects. The cross-product ratios (relative odds) 



measures of association, 



m a m k j 

are left invariant. Additionally, the entries of the doubly-stochastic table provide maximum 
entropy estimates of the original flows, given the row and column constraints 22]. 

For large sparse flow tables, only the nonzero entries, together with their row and column 
coordinates are needed. Row and column (biproportional) multipliers can be iteratively 
computed by sequentially accessing the nonzero cells |23| . If the table is "critically sparse" , 
various convergence difficulties may occur. Nonzero entries that are "unsupported" -that is, 
not part of a set of iV nonzero entries, no two in the same row and column- may converge to 
zero and/or the biproportional multipliers may not converge (3], p. 19] 24] 25, p. 171]. (The 



2 



scaling was successfully implemented with a 3,140 x 3,140 1965-70 intercounty migration 



table for the United States 
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9. 



15j-as well as for a more aggregate 510 x 510 table for the US 



13f| . Smoothing procedures can be used to modify the zero-nonzero structure of a flow table, 



particularly if it is critically sparse [261. 1271 ] .) The "first strongly polynomial-time algorithm 
for matrix scaling" was reported in [2s| • 

In the second step of the procedure, the doubly-stochastic matrix is converted to a series 
of directed (0,1) graphs (digraphs), by applying thresholds to its entries. As the thresholds 
are progressively lowered, larger and larger strong components (a directed path existing 
from any member of a component to any other) of the resulting graphs are found. This 
process (a simple variant of well-known single- linkage [nearest neighbor] clustering) can be 
represented by the familiar dendrogram or tree diagram used in hierarchical cluster analysis 
and cladistics/phylogeny (cf. |29| ) . 

A FORTRAN implementation of the two-stage process was given in {^(J, as well as one 
in the SAS (Statistical Analysis System) framework [31| . The noted computer scientist 
R. E. Tarjan 32| devised an 0(M(\ogN) 2 ) algorithm 33J and, then, a further improved 
0(M(\og N)) method 34j, where N is the number of nodes and M the number of edges 



of a directed graph. (These substantially improved upon the earlier works 
required the computations of transitive closures of graphs, and were 0(MN] 



311 ] . which 



in nature.) 



A FORTRAN coding-involving linked , ists -of the improved. Tarjan algorithm M was pre- 
sented in [35], and applied in the US intercounty study [15(. If the graph-theoretic (0,1) 
structure of the network under study is not strongly connected {36], independent analyses of 
the subsystems of the network is appropriate. 

The goodness- of -fit of the dendrogram generated to the doubly-stochastic table itself can 
be evaluated-and possibly employed, it would seem, as an optimization criterion. Distances 
between nodes in the dendrogram satisfy the (stronger than triangular) ultrametric inequal- 
ity, dij < m&x(dik,djk) 37I, p. 245] 38, eq. (2.2)]. 

Geographic subdivisions (or groups of subdivisions) that enter into the bulk of the den- 
drogram at the weakest levels are those with the broadest ties. Typically, these have been 
found to be "cosmopolitan" , hub-like prototypical example being the French capital, 

Paris [3J, sec. 4.1] 6j. Similarly, in parallel analyses of other internal migration tables, the 
cosmopolitan/non-provincial natures of London, Barcelona, Milan, West Berlin, Moscow, 
Manila, Bucharest, Montreal, Zurich, Santiago, Tunis and Istanbul were-among others- 
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highlighted in the respective dendrograms for their nations jjj sec. 8.2] [14] . pp. 181-182] 
p. 55]. In the intercounty analysis for the US, the most cosmopolitan entities were: (1) the 
centrally located paired Illinois counties of Cook (Chicago) and neighboring, suburban Du 
Page; (2) the nation's capital, Washington, D. C; and (3) the paired south Florida (retire- 
ment) counties of Dade (Miami) and Broward (Ft. Lauderdale) 0, [l5]. (In general, counties 
with large military installations, large college populations, or that were state capitals also 
interacted relatively broadly with other areas.) 

It should be emphasized that although the indicated cosmopolitan areas may generally 
have relatively large populations, this can not, in and of itself, explain the wide national ties 
observed, since the double-standardization, in effect, renders all areas of equal overall size. 

Additionally, geographically isolated areas-such as the Japanese islands of Kyushu and 
Shikoku-emerged as well-defined clusters (regions) of their constituent subdivisions ("pre- 
fectures" in the Japanese case) in the dendrograms (cf. (39), Q), and similarly the Italian 
islands of Sicily and Sardinia [jjj], and the North and South Islands of New Zealand and 
Newfoundland (Canada) Jj], p. 1]. The eight counties of Connecticut, and other New Eng- 
land groupings, as further examples, were also very prominent in the highly disaggregated 
US analysis [15]. Relatedly, in a study based solely upon the 1968 movement of college 
students among the fifty states, the six New England states were strongly clustered ll|, Fig. 

1]- 

Though quite successful, evidently, in revealing hub-like and clustering behavior in 
recorded flows, the indicated series of studies did not address the recently-emerging, 
theoretically-important issues of scale-free networks, power-law descriptions, network evolu- 
tion and vulnerability, and small-world properties that have been stressed by Barabasi [l| 
(and his colleagues and many others in the growing field). (For critiques of these matters, 



sec 



41 



421] .) In this regard, one might-using the indicated two-stage procedure-compare 



the hierarchical structure of geographic areas using internal migration tables at different 
levels of geographic aggregation (counties, states, regions...) To again use the example of 
France, based on a 1962-68 21 x 21 interregional table, Region Parisienne was certainly the 
most hub-like jj], sec. 4.1] while using a finer 89 x 89 1954-62 interdepartmental table, 
the dyad composed of Seine (that is Paris and its immediate suburbs) together with the 
encircling Seine-et-Oise (administratively eliminated in 1964) was most cosmopolitan ?J s), 
sec. 6.1]. 
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It would be of interest to develop a theory-making use of the rich mathematical structure 
of doubly-stochastic matrices-by which the statistical significance of apparent hubs and 

n n 

clusters in dendrograms could be evaluated [15|, pp. 7-8] [43]. In the geographic context of 
internal migration tables, where nearby areas have a strong distance-adversion predilection 
for binding, it seems unlikely that most clustering results generated could be considered to 
be-in any standard sense- "random" in nature. On the other hand, other types of "ori gin 



44j . journal citations 



destination" tables, such as those for occupational mobility 

pp. 125-153], interindustry (input-output) flows [h]], brand switchesj3|, sec. 9.6 } |45l| . crime 
switches 3, sec. 9.7] Q, Table XII] and (Morse code) confusions 



r 

[3J, sec. 9i 



among 



others, clearly lack such a geographic dimension. An efficient algorithm-considered as a 
nonlinear dynamical s yst em-to generate random bistochastic matrices has recently been 



presented 



20| (cf. |48|,|49j). 



The creative, productive network analyst M. E. J. Newman has written: "Edge weights 
in networks have, with some exceptions . . . received relatively little attention in the physics 
[emphasis added] literature for the excellent reason that in any field one is well advised to 
look at the simple cases first (unweighted networks). On the other hand, there are many 
cases where edge weights are known for networks, and to ignore them is to throw out a lot 
of data that, in theory at least, could help us to understand these systems better" [501 ] . Of 
course, the numerous applications of the two-stage procedure we have discussed above have, 
in fact, been to such weighted networks. 

In 50J], Newman applied the famous Ford-Fulkerson max-flow/min-cut theorem to 
weighted networks (which he mapped onto un weig hted multigraphs) . Earlier, this theorem 
lad been used to study Spanish [40J , Philippine 5jj . and Brazilian, Mexican and Argentinian 



02] internal migration and CS interindnstry 'Slows pp. 18-28] .>5| all the corresponding 



flows now being left unadjusted, that is not standardized. In this "multiterminal" approach, 
the maximum flow and the dual minimum edge cut-sets, between all ordered pairs of nodes 
are found. Those cuts (often few or even null in number) which partition the N nodes 
nontrivially-that is, into two sets each of cardinality greater than 1-are noted. The set in 
each such pair with the fewer nodes is regarded as a nodal cluster (region, in the geographic 
context). It has the interesting, defining property that fewer people migrate into (from) 
it, as a whole, than into (from) its node. In the Spanish context, the (nodal) province of 
Badajoz was found to have a particularly large out-migration sphere of influence, and the 
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Basque) province of Vizcaya (site of Bilbao and Guernica), an extensive in- migration field 



40]. 



The networks formed by the World Wide Web and the Internet have been the focus 
of much recent interest Their structures are typically represented by N x N adja- 
cency matrices, the entries of which are simply or 1, rather than natural numbers, as 
in internal migration and other flow tables. One might investigate whether the two-stage 
double-standardization and hierarchical clustering, and the (complementary) multiterminal 
max-flow/min-cut procedures we have sought to bring to the attention of the active body of 
contemporary network theorists, could yield novel insights into these and other important 
modern structures. 

In closing, it might be of interest to describe the immediate motivation for this particular 
note. I had done no further work applying the methods described above after 1985, being 
aware of, but not absorbed in recent developments in network analysis. In May, 2008, 
Mathematical Reviews asked me to review the book of Tom Siegfried [2j , chapter 8 of which 
is devoted to the on-going activities in network analysis. This further led me (thanks to D. 
E. Boyce) to the book of Barabasi JJ. I, then, e-mailed Barabasi, pointing out the use of 
the earlier, widely-applied clustering methods. In reply, he wrote, in part: "I guess you were 
another demo of everything being a question of timing- after a quick look it does appear 
that many things you did have came back as questions - with much more detailed data- 
again in the network community today. No, I was not aware of your papers, unfortunately, 
and it is hard to know how to get them back into the flow of the system". The present 
note might be seen as an effort in that direction, alerting present-day investigators to these 
demonstratedly fruitful research methodologies. 
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